250 ◾ Bioinformatics
of a sequence for detecting the presence of the known binding sites of interest. PWM files
of known motifs can be download from motifs’ database such as JASPAR at “https://jaspar.
genereg.net/”. We will use MAST, which is one of the MEME Suite programs, to search for
known motifs in our sequences. Assume that we wish to search for TATA binding site in
our example sequences. First, we need to download the motif file from a database and then
run the program as follows:
wget https://jaspar.genereg.net/api/v1/matrix/MA0108.1.meme
mast -mt 5e-02 \
-oc mast_chip1 \
MA0108.1.meme \
chip1_peaks.fasta
Three output files (mast.html, mast.xml, and mast.txt) will be saved in the “mast_chip1”
directory. You can use “firefox mast.html” to view the results.
6.4 SUMMARY
Identification of binding sites of proteins on the genomic DNA is critical for understand-
ing gene regulation, pathways, and role of specific proteins in gene regulation and their
implications of some diseases. Therefore, ChIP-Seq is used to study epigenetic change that
affects gene expression and the impact of such changes on diseases. The ChIP-Seq is the
most effective way to identify protein-binding sites on the genomic DNA. The binding
sites of transcription factors and RNA polymerase II are found in the promoter regions of
genes. In a ChIP experiment, the genomic DNA is cut into fragments. The DNA regions,
where the protein of interest binds, are precipitated using a specific antibody. The protein
molecules are then removed from the DNA fragments. The isolated DNA fragments are
then sequenced using one of the sequencing techniques. The DNA library preparation and
sequencing are similar to that of other sequencing applications. The sequence reads (in
FASTQ files) produced by the sequencer are for the ChIP-Seq DNA reads that are likely to
contain the binding sites for the protein of interest. The quality control step is carried out to
reduce the error and to trim and remove adaptors and other technical sequences that may
affect the analysis results. The cleaned reads are then aligned to a reference genome to pro-
duce BAM files that contain the alignment information of the ChIP reads. The unaligned,
random, and mitochondrial reads are usually removed from the BAM files to reduce the
computational burden. The peak enrichment regions, where the binding sites are most
likely to be found, are called using one of the peak-calling programs. The peak information
for each sample is saved in a BED file. We have used R Bioconductor package to visualize
the distribution of the peaks and to perform annotation and functional analysis including
GO and KEGG pathways. GO and KEGG enrichment analyses provide knowledge-based
biological information. Finally, we used motif discovery programs to identify the motifs
on the promoter regions.